docs(experiments): add model experiments spec#3463
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryDocs-only PR adding the model-experiments spec; all four incremental commits are spec refinements with no logic or security concerns. The activation-minimum WARNING from the previous review is now resolved. Incremental Changes Reviewed (all 4 commits)
Previous Issues — Status
Files Reviewed (2 files)
Reviewed by claude-sonnet-4.6 · 286,158 tokens Review guidance: REVIEW.md from base branch |
|
|
||
| Model experiments exist only to A/B test preview or otherwise experimental model checkpoints in partnership with model providers. They are not a general-purpose traffic-splitting or rollout mechanism for production models. | ||
|
|
||
| An experimented `public_model_id` MUST be a dedicated preview or experiment id that users explicitly select. Production model ids MUST NOT be silently bucketed. Experimented ids MUST NOT be added to `kilo-auto` candidate sets, presets, or other automatic selection paths unless the spec is explicitly changed to allow that behavior. |
There was a problem hiding this comment.
WARNING: Scope exclusions are not enforced by the current routing/admin path.
Trace:
- The spec adds hard guarantees that experimented IDs are dedicated preview/experiment IDs and that
kilo-internal/...traffic is outside model-experiment routing. - Current admin/routing code only requires
public_model_idto be a non-empty string;assertActivatabledoes not reject production IDs orkilo-internal/...;getProviderchecks experiment membership beforekilo-internal/...custom LLM routing. - An admin can create/activate an experiment for a production model ID or
kilo-internal/...ID, after which membership causes the experiment path to run despite the new spec saying those states MUST NOT happen.
Impact: The merged spec would claim opt-in/preview-only and kilo-internal/... exclusion invariants that current code does not preserve. A misconfigured active experiment can silently reroute production or internal-model traffic and capture experiment attribution/prompts under a contract that says that traffic is out of scope.
There was a problem hiding this comment.
This is true; i scoped this out. I think we might want to reconsider how exactly we serve these experiments (it might make sense to combine them with a fallback for when an experiment pauses for instance, and put them behind another model (like kilo-auto/free). However, we do these changes manually so i think we should always do this carefully.
| - `paused`: gateway returns a local model-unavailable response for the experimented public id and MUST NOT silently fall through to default model routing. | ||
| - `completed`: historical and non-routing; removed from routing membership and eligible to coexist with a draft or active replacement for the same public id. | ||
|
|
||
| Activation MUST validate that the experiment has at least two variants, every variant has positive weight, every variant has a current version effective at or before activation time, and no other active or paused experiment targets the same public id. |
There was a problem hiding this comment.
WARNING: Activation minimum conflicts with current behavior.
Trace:
- The spec requires activation to validate at least two variants.
- Current implementation rejects only
variants.length < 1; the admin UI says “at least 1 variant”; andmodel-experiments-router.test.tsexplicitly activates a one-variant experiment successfully. - The repository already permits and tests a one-variant active experiment, while the new source-of-truth spec says activation MUST reject it.
Impact: Future reviewers/implementers will rely on a false invariant. One-variant experiments can be activated today, so the experiment may not be an A/B test and cohort/reporting assumptions based on “at least two variants” can be wrong.
There was a problem hiding this comment.
oh nice catch, this is actually wrong. Though we started out with a/b testing i also want this to be used with purely sequential testing
Summary
.specs/model-experiments.mdas the durable source of truth for model experiment scope, routing, lifecycle, retention, reporting, and secret-handling rules.AGENTS.mdso future model-experiment changes read the business rules before editing the domain.Verification
Visual Changes
N/A
Reviewer Notes